The Annotation & Quality Assurance subsystem in KAZU provides a comprehensive set of tools for managing and evaluating data annotations, particularly in conjunction with Label Studio. It facilitates the conversion of KAZU's internal data structures to and from Label Studio formats, enabling efficient annotation workflows. Furthermore, it includes a robust acceptance testing framework to ensure the quality, consistency, and accuracy of both human annotations and the results generated by the KAZU pipeline, thereby maintaining high standards for data integrity and model performance.
Components
Training and Evaluation Manager
This component orchestrates the end-to-end training and evaluation workflows within KAZU. It is responsible for loading models, setting up and executing the document processing pipeline, calculating performance metrics, and integrating with Label Studio for visualization and review of evaluation results.
Referenced Source Code
Label Studio Data Conversion & Management
This component handles all interactions with the Label Studio annotation platform. Its responsibilities include converting KAZU's internal document and entity formats to Label Studio tasks, converting Label Studio annotations back into KAZU's data models, and managing projects, views, and tasks within the Label Studio API.
Acceptance Testing Framework
This component provides a comprehensive framework for evaluating the quality and consistency of annotations and the overall performance of the KAZU pipeline. It includes functionalities for scoring sections, aggregating NER and linking results, and checking if results meet predefined acceptance thresholds.
NER Processing Steps
This component encapsulates the specific steps involved in Named Entity Recognition (NER) within the KAZU pipeline. It primarily focuses on using Hugging Face Transformers models for token classification and processing tokenized words to identify and classify entities.
Core Document Processing Pipeline
This component represents the central execution engine of KAZU, responsible for orchestrating the sequential application of various processing steps to documents. It defines the overall flow for transforming raw text into annotated and linked information.
Referenced Source Code
KAZU Data Models
This component defines the fundamental data structures used throughout the KAZU system to represent linguistic and semantic information. It includes classes for documents, sections, entities, character spans, and mappings, providing a standardized way to handle annotated text.
Referenced Source Code
General Utilities
This component provides a collection of common utility functions that support various operations across the KAZU system. This includes functionalities for grouping data, and building and testing model packs, which are crucial for managing and deploying trained models.
Web Integration Utilities
This component contains utility functions specifically designed for web-related functionalities, particularly in the context of integrating with Label Studio. It provides methods for setting up default configurations and facilitating web-based interactions.